R Exercise 4. Clustering and classification

I have loaded the Boston dataset that contains information on housing values in suburbs of Boston. Data has 5-6 observation and 14 differeent variables like crime rate of the twon, number of rooms in dwelling,pupil-teacher ration, proportion of lower status population, etc. All these variables are the key that help to eveluate the value of the houses in this area.

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## [1] 506  14

The graphical repreentation of the variables show that in some cases there exist a strong correltion between varibales; also the accumaulation tends to be close to the edge.

## function (x, ...) 
## UseMethod("pairs")
## <bytecode: 0x7f81f32bbd18>
## <environment: namespace:graphics>
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

However, I will also plot the correlation matrix in order to explore the data in more details.

## Warning: package 'tidyverse' was built under R version 3.3.2
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'ggplot2' was built under R version 3.3.2
## Warning: package 'tidyr' was built under R version 3.3.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
## select(): dplyr, MASS
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00

The correlogram is giving a more comprehensive picture of the coreelations between the variables. Therefore, one can clearly observe a negative correlation between indus/dis (proportion of non-retail business acres per town to weighted mean of distances to five Boston employment centres), nox/dis (nitrogen oxides concentration to weighted mean of distances to five Boston employment centres), age/dis (proportion of owner-occupied units built prior to 1940 to weighted mean of distances to five Boston employment centres) and lstat/medv (lower status of the population to median value of owner-occupied homes). Positive correlation is observed in indus/nox (proportion of non-retail business acres per town to nitrogen oxides concentration), rad/tax (index of accessibility to radial highways to full-value property-tax rate per $10,000), nox/age. Having a closer look at the correlations, these seem to be logical.

However, in order to be able to classify the variable, it has to be standirdized so that it is comparable.

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
## [1] "matrix"

Standardization of the variable have led to the fact that the range of the variable have decrease. Therefore, these standardized variables will be used in the further analysis

Also, I will create yet another categoriacl variable crime that will be created from the continuous one crim. I will remove crim variable from the dataset so that it does not affect the further analysis

## crime
##      low  med_low med_high     high 
##     1784     1854     1675     1771
## Warning in data.frame(boston_scaled, crime): row names were found from a
## short variable and have been discarded

After the necessary transformation, I will divide the data by the train (contain 80% of the data) and the test one (20% of the data) in order to proceed with the Linear Discrimination Analysis.

## Warning in data.frame(boston_scaled, crime): row names were found from a
## short variable and have been discarded

Now, I will fit the linear discriminant analysis on the train set.I will use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. LDA will fid a combination of the explanatory variable in such way so that it can separate the classes of the crime variable the best

## [1] "matrix"
## crime
##      low  med_low med_high     high 
##     1784     1854     1675     1771
## Warning in data.frame(boston_scaled, crime): row names were found from a
## short variable and have been discarded
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2541027 0.2627493 0.2350450 0.2481031 
## 
## Group means:
##                   zn       indus        chas        nox          rm
## low       0.24825158 -0.24340253 -0.01532391 -0.2261148  0.10701407
## med_low  -0.09325430 -0.15077666 -0.07401979 -0.1210213 -0.01790638
## med_high -0.11209743 -0.02969243  0.03507215 -0.0957422 -0.02303700
## high     -0.08486296  0.44415425  0.06929698  0.4687796 -0.06158741
##                  age         dis        rad        tax      ptratio
## low      -0.25651713  0.22937984 -0.2281897 -0.2499182 -0.225008350
## med_low  -0.03568832  0.05231709 -0.1906708 -0.1855142  0.002298817
## med_high -0.06551581  0.03529415 -0.1828589 -0.1556714 -0.031407746
## high      0.38408606 -0.33543923  0.6087474  0.6006695  0.238350986
##                black       lstat        medv
## low       0.02052373 -0.15301093  0.15569204
## med_low   0.09373194 -0.08531061  0.01445482
## med_high  0.16548201 -0.13614641  0.02514190
## high     -0.33290858  0.37593407 -0.20536690
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.42213641  0.73413008 -0.08391949
## indus    0.15436624  0.02744157 -1.42943840
## chas     0.07603136 -0.01181330 -0.35080816
## nox      0.24203818  0.07430928  0.28917581
## rm       0.17806697  0.09769646 -0.06397810
## age      0.17154013 -0.42563523  0.48406239
## dis      0.12993228 -0.06623937 -0.10790381
## rad      0.54586604  0.33856140 -0.36477229
## tax      0.18870520 -0.52247390  0.34604985
## ptratio  0.13795189 -0.19436834  0.44188529
## black   -0.04277270 -0.36795929 -0.30821605
## lstat    0.36255746  0.62336005  0.46237427
## medv     0.26216018  0.30515033  0.35204524
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.8046 0.1566 0.0388

In order to see the full picture of the obtained results, I have plotted a graph where different colours are resposible for different clsses of the variables. The arrow indicates the impact of each of the predictor variable in the model.

Now, I will remove the crime from the data and will make a prediction for the new dataset.

## crime
##      low  med_low med_high     high 
##     1784     1854     1675     1771
## Warning in data.frame(boston_scaled, crime): row names were found from a
## short variable and have been discarded
##           predicted
## correct    low med_low med_high high
##   low      109     121       48   63
##   med_low   83     169       66   62
##   med_high  75     129       54   59
##   high      76      57       42  204

The data show that prediction for the high crime rates are correct ones. However, the prediction for the low and medium crime rates are not always correct

Analysing the data from another angel, I will cluster observation and perform the k-means model that will asign cluster based on the distance between variables. Distance between the variablesis a measure of its similarity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.119  85.620 170.500 226.300 371.900 626.000

The plot is showing the scaled pairs that are lotted against each other

## Warning in data.frame(boston_scaled, crime): row names were found from a
## short variable and have been discarded
##           predicted
## correct    low med_low med_high high
##   low      134     111       41   73
##   med_low   64     189       45   74
##   med_high  61     120       58   80
##   high      72      53       41  201
## [1] 5667   13
## [1] 13  3
## Warning: package 'plotly' was built under R version 3.3.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Also, here is another 3D plot where the color is defined by the clusters of the k-mean. It shows the same classes as in LDA model, howver, eithout classification by the crime rate.

## Warning in data.frame(boston_scaled, crime): row names were found from a
## short variable and have been discarded
##           predicted
## correct    low med_low med_high high
##   low      118     107       54   52
##   med_low   73     182       58   63
##   med_high  78     131       56   57
##   high      72      60       31  225